Filtering Large Propositional Rule Sets While Retaining Classifier Performance

نویسندگان

Thomas Ågotnes

Turid Follestad

چکیده

Data mining is the problem of inducing models from data. Models have both a descriptive and a predictive aspect. Descriptive models can be inspected and used for knowledge discovery. Models consisting of decision rules – such as those produced by methods from Pawlak’s rough set theory – are in principle descriptive, but in practice the induced models are too large to be inspected. In this thesis, extracting descriptive models from already induced complex models is considered. According to the principle of Occam’s razor, the simplest of two models both consistent with the observed data should be chosen. A descriptivemodel can be found by simplifying a complex model while retaining predictive performance. The approach taken in this thesis is rule filtering; post-pruning of complete rules from a model. Two methods for finding high-performance subsets of a set of rules are investigated. The first is to use a genetic algorithm to search the space of subsets. The second method is to create an ordering of a rule set by sorting the rules according to a quality measure for individual rules. Subsets with a particular cardinality and expected good predictive performance can then be constructed by taking the first rules in the ordering. Algorithms for the two methods have been implemented and is available for general use in the ROSETTA system, a toolkit for data analysis within the framework of rough set theory. Predictive performance is estimated using ROC analysis, and ten different formulas from the literature that can be used to define rule quality are implemented. An extensive experiment on a real-world data set describing patients with suspected acute appendicitis is included. In this study, rule sets consisting of six to twelve rules with no significantly different estimated predictive performance compared to full models consisting of between 400 and 500 rules were found. Another experiment confirms these results. In the experiments, statistical hypothesis testing was used to assert difference between performance measures derived from ROC analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding Small High Performance Subsets of Induced Rule Sets: Extended Summary

Models consisting of decision rules – such as those produced by methods from Pawlak’s rough set theory – generally have a white-box nature, but in practice induced models are too large to be inspected. Here, we investigate methods for simplifying complex models while retaining predictive performance. The approach taken is rule filtering, i.e. post-pruning of complete rules. Two methods for find...

متن کامل

Collaborative Ensemble Learning: Combining Collaborative and Content-Based Information Filtering via Hierarchical Bayes

Collaborative filtering (CF) and contentbased filtering (CBF) have widely been used in information filtering applications, both approaches having their individual strengths and weaknesses. This paper proposes a novel probabilistic framework to unify CF and CBF, named collaborative ensemble learning. Based on content based probabilistic models for each user’s preferences (the CBF idea), it combi...

متن کامل

Voltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm

A new technique presents to improve the performance of dynamic voltage restorer (DVR) for voltage sag mitigation. This control scheme is based on cuckoo search algorithm with tree fuzzy rule based classifier (CSA-TFRC). CSA is used for optimizing the output of TFRC so the classification output of the network is enhanced. While, the combination of cuckoo search algorithm, fuzzy and decision tree...

متن کامل

Learning Rules that Classify E-Mail

Two methods for learning text classifiers are compared on classification problems that might arise in filtering and filing personM e-mail messages: a "traxiitionM IR" method based on TF-IDF weighting, and a new method for learning sets of "keyword-spotting rules" based on the RIPPER rule learning algorithm. It is demonstrated that both methods obtain significant generalizations from a small num...

متن کامل

Sets in homotopy type theory

Homotopy Type Theory may be seen as an internal language for the ∞category of weak ∞-groupoids which in particular models the univalence axiom. Voevodsky proposes this language for weak ∞-groupoids as a new foundation for mathematics called the Univalent Foundations of Mathematics. It includes the sets as weak ∞-groupoids with contractible connected components, and thereby it includes (much of)...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Filtering Large Propositional Rule Sets While Retaining Classifier Performance

نویسندگان

چکیده

منابع مشابه

Finding Small High Performance Subsets of Induced Rule Sets: Extended Summary

Collaborative Ensemble Learning: Combining Collaborative and Content-Based Information Filtering via Hierarchical Bayes

Voltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm

Learning Rules that Classify E-Mail

Sets in homotopy type theory

عنوان ژورنال:

اشتراک گذاری